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ABSTRACT 

In this paper, we address the problem of evaluating his- 
torical queries on graphs. To this end, we investigate the 
use of graph deltas, i.e., a log of time-annotated graph op- 
erations. Our storage model maintains the current graph 
snapshot and the delta. We reconstruct past snapshots by 
applying appropriate parts of the graph delta on the current 
snapshot. Query evaluation proceeds on the reconstructed 
snapshots but we also propose algorithms based mostly on 
deltas for efficiency. We introduce various techniques for 
improving performance, including materializing intermedi- 
ate snapshots, partial reconstruction and indexing deltas. 



1. INTRODUCTION 

In recent years, there has been increased interest in graph 
structures representing real-world networks such as social 
networks, citation and hyperlink networks as well as biology 
and computer networks. In this paper, we focus on social 
network graphs. Such graphs are characterized by large- 
scale, since the number of participating nodes reaches mil- 
lions. Social graphs are also highly dynamic, since the cor- 
responding social networks constantly evolve through time. 

An interesting problem in this setting is supporting his- 
torical queries. By historical queries, we refer to queries 
that involve the state of the graph at any time interval in 
the past. For instance, consider queries about the popular- 
ity (e.g., number of friends) of a user at some specific time 
in the past, about how this popularity changed over time 
as well as queries about the diameter of a network over a 
time period. Historical queries are important when study- 
ing graphs that change through time for various applications 
such as version maintenance and monitoring and analyzing 
the evolution of the graph. 

However, most recent research mainly addresses the prob- 
lems introduced by the large-scale of social graphs [4] [5] [7] 
and ignores the temporal aspects by focusing on queries that 
involve only the current graph snapshot. In this paper, we 
introduce a framework for supporting historical queries that 
involve one or more graph snapshots. To this end, we pro- 
pose a general model for incorporating information about 
how a graph changes through time based on graph deltas. 
A graph delta is a log of time-annotated graph update oper- 
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ations such as the addition and removal of nodes and edges. 
At each time instant, we store a current snapshot of the 
graph plus the graph delta that records the changes that 
have occurred. By applying the graph delta on the snapshot, 
any past snapshot can be reconstructed. We also discuss ma- 
terializing intermediate snapshots to improve performance. 

The evaluation of historical queries is based on a two- 
phase plan that first reconstructs the snapshot or snapshots 
that are required to evaluate the query. We address the high 
cost of snapshot reconstruction by proposing query plans 
that rely only or mostly on the delta when this is possible. 
Furthermore, for node-centric queries, i.e., queries that ac- 
cess only parts of the graph, we introduce partial snapshot 
reconstruction that constructs only the subgraph required to 
evaluate the query. Finally, we show how building indexes 
on deltas can further improve efficiency. 

The rest of this paper is organized as follows. Section 2 
introduces our model for storing time-evolving graphs. Sec- 
tion 3 presents a classification of historical queries and query 
plans for their evaluation. Section 4 reports experimental re- 
sults. Section 5 summarizes related work, while Section 6 
concludes the paper. 

2. MODEL 

We model a social network as an undirected graph, G = 
(V,E). Each graph node Vi G V corresponds to a user m 
of the social network. Edges (vt, Vj) £ E capture social 
relationships (i.e., friendship) between users m and Uj that 
correspond to nodes Vi and Vj G V respectively. 

Note that our model supports symmetrical social rela- 
tionships between users, such as friendship in Facebook. If 
we consider asymmetrical relationships such as the ones in 
Twitter, then the graph representing the network is directed 
as the edges capturing the "follower" and "following" rela- 
tionships in the network are also directed, i.e., u, can "fol- 
low" Uj, while Uj does not "follow" Uj. In the rest of the 
paper, we focus on undirected graphs but our algorithms 
can be easily adopted to account for directed graphs. 

Our model for capturing the evolution of the social net- 
work through time is based on the use of graph snapshots 
and graph deltas. 

2.1 Snapshots and Deltas 

We consider an element, node or edge, of a graph G as 
valid for the time periods for which the corresponding item 
(user or friendship) of the social network it represents is 
also valid. Each node Vi £ V is valid for the time periods 
for which the corresponding user Ui participates in the so- 
cial network represented by the graph. Similarly, each edge 
(vi,Vj) £ E is valid for the time periods that the correspond- 
ing users Ui and Uj are friends in the network. 

Definition 1 (Graph Snapshot). A graph snapshot 
of a graph G, at a time point t, is defined as the graph SGt = 



(V',E'), where V' C V and E' C E, such that v t £ V , if 
and only if, Vi is valid at time point t and (vi,Vj) g E' , if 
and only if, (vt, Vj) is valid at time point t. 

Graph G captures the social network as it evolves. Any 
update in the social network is directly reflected on G. A 
graph snapshot SGt of G can be simply viewed as an in- 
stance of G frozen at time point t, capturing the state of G 
at this specific time point. 

We focus on the structure of the social network and thus 
consider the following four basic update operations that af- 
fect its structure: (1) the addition of a new user Ui in the 
social network, (2) the creation of a new friendship rela- 
tionship between two users Ui and Uj that were not friends, 
(3) the removal of an existing user Ui and (4) the deletion 
of an existing friendship relationship between two users Ui 
and Uj that were friends. The corresponding operations in 
G = (V, E) are: 

1. addN ode(vi) that adds a new node Vi in V. 

2. addEdge(vi,Vj) that creates a new edge (vi,Vj) between 
Vi and Vj in E. 

3. remN ode(vi) that deletes Vi from V and all edges that 
involve Vi from E. 

4. remEdge(vi,Vj) that deletes edge (vi,Vj) from E. 

Given two graph snapshots SGt k and SGt t of a graph 
G, we maintain in deltas the operations that, if applied to 
SGt k , produce SGt l ■ In the spirit of [8], let us consider first, 
deltas as sets. 

Definition 2 (Delta). Given two snapshots SGt k = 
(V k ,E k ) and SGt t = (Vi,E[) of a graph G, delta A tk , tl is a 
set of operations with the following properties: 

1. V«i, s.t, Vi <f_ Vk and Vi £ Vi, addNode(vi) £ At k ,t r 

2. "ivi, s.t., Vi £ Vk and Vi ^ Vj, remNode(vi) £ A tkt t t - 

3. \/(vi,Vj), s.t, (vi,Vj) Ek and (vi,Vj) £ Ei, 
addEdge(vi,Vj) £ At k ,t r 

4. V(t>i,Uj), s.t., (vi,Vj) £ E k , (vi,Vj) £ E h Vi £ Vi andvj 
£ Vi, remEdge(vi,Vj) £ A tk ,t t . 

These are the only operations that appear in At k ,tf 

Ai fc ,tj is the unique minimal set needed for deriving SGt l 
from SGt k , since it does not contain any redundant opera- 
tions and all its operations are necessary for producing SGt t ■ 

Lemma 1. At,,,*, is unique and minimal. 

Applying deltas on graph snapshots is denoted using o: 
At k ,t t ° SGt k — SG tl . We say that a delta A tk ,t l is a for- 
ward delta, if tk <ti, i.e., when it includes operations to be 
applied on an older snapshot to create a more recent one. 

In our current approach, we record all update operations. 
Thus, our deltas may contain redundant operations. For ex- 
ample, consider an edge (vi,Vj) that represents a friendship 
relationship created between Ui and Uj and later deleted. 
We maintain both corresponding addEdge and remEdge op- 
erations, since we want to be able to retrieve all snapshots, 
including the one when (ui,Uj) was valid. We maintain such 
deltas as sets of operations annotated with the time point 
at which the operation occurred. Updates are recorder as 
they happen in the social network, "forward" in time. We 
call such deltas Interval Deltas. 

Definition 3 (Interval Delta). For a graph G and 
a time interval [to,t cur ], an interval delta A[ t0itciir ] is a set 
of pairs, (op,t), such that a pair (op,t) £ A[ t0it(!ur ], if and 
only if, operation op appeared in G at time point t £ [to,t cur ] . 

In the rest of this paper, we refer to interval deltas as 
deltas for simplicity as this is the only type of deltas we use 



Algorithm 1 ForRec(SG to , A [t0iW]j t') 

Input: SG to , A[ t0itctw ,] , t' £ [to,t cur ] 
Output: SG t , 

1: t:=t 

2: copy SGt to SG t , 

3: Start from the end of Ai( G { 1 

4: while t < t' do 

5: Read next operation op and its time t in Arj 0it<Ju 1 

6: if op = addNode(vi) then 

7: add new node Vi in SG t i 

8: else if op = addEdge(vi,vj) then 

9: Find Vi and Vj in SG t / 

10: add new edge (vi,Vj) 

11: else if op = remNode(vi) then 
12: Find v { in SG t , 
13: Remove Vi from SG t i 

14: else 

15: Find Vi and Vj in SG t i 

16: remove edge (vi,Vj) 

17: end if 
18: end while 
19: return SG t ,; 



in our approach. Since we record all update operations in 
the time interval, our deltas are not minimal. As explained, 
redundant information is required for being able to retrieve 
a snapshot for any time point in the interval. Formally, we 
want to ensure that our deltas are complete. 

Definition 4 (Complete Delta). A delta Ar t0jtour ] 
is complete, if given the graph snapshot SGt , we can derive 
any snapshot SG t i , t' £ [to,t cur ], by applying the operations 
ops of A[ t0itciir ] for which t < t' , that is if we apply A[ iQit /] C 

A [t , *«„■]•' 

A[ t() ,t'] o SGt — SG t ' 

For complete deltas, to reconstruct any snapshot, we just 
need an initial snapshot and the delta. Algorithm [1] presents 
the reconstruction process, assuming for simplicity, that op- 
erations in the deltas are ordered by time. 
Inverted Deltas. So far we have considered only a forward 
application of deltas. Let us now consider the case where we 
want to move u backwards" in time. That is, given a snapshot 
SGt k at tk, we want to retrieve a snapshot SGt [ at ti, where 
ti < tk- To achieve this, we define an inverted delta and 
apply this delta on SGk- 

Definition 5 (Inverted Delta (A)). Given a graph 
snapshot SGt cur and A[ t0itctir ], we define the inverted Delta, 
A[ t0jtcuT ,] , to be the set of operations such that: 

for each t' £ [t ,t cur ], A [t , ttcur] o SG tcMr = SG V 
and A [t , ttcur] C A [t0iW] . 

To invert our deltas, we apply the reverse operation for 
each of the operations they include. In particular: 

1. addN ode(vi) = remNode(vi). 

2. addEdge(vi,Vj) = remEdge(vi,Vj). 

3. remN ode(vi) = addNode(vi). 

4. remEdge(vi,Vj) — addEdge(vi,Vj). 

All operations can be inverted as long as the necessary infor- 
mation is maintained in the forward delta. In particular, to 
maintain a complete delta that is also invertible, we make the 
following assumption. Before recording any remNode(vi) 
in the delta, we record first remEdge(vi,Vj) operations, for 
each edge of Vi, annotated with the same time point as the 
remNode(vi) operation. 

Algorithm [2] presents the backward reconstruction proce- 
dure that given a graph snapshot derives a previous one by 
inverting the delta file. 



Algorithm 2 BackRec(SG tcur , A [t0iW] , t') 

Input: SG tcur , A[t ,t CUT ]> *' e [*0,*cur] 
Output: SG t / 

1. t '.= t C UT 

2: copy SGt cur to SG t / 

3: Open A« $ i an d start reading from its beginning 
4: while « > t' do 

5: Read next operation op and its time t in Ar to t OUP ] 
6: Apply op at SGf/ 
7: end while 
8: return SG t r, 



2.2 Storage and Maintenance 

We maintain forward, complete and invertible deltas. Let 
us now discuss issues regarding the efficient reconstruction 
of snapshots using such deltas. 

Theorem 1. For a graph G, given a delta A[t ,t cur ], if 
the delta is complete and invertible, to reconstruct a graph 
snapshot of G at any time point t G [to,t cur ], it suffices to 
maintain only one graph snapshot. 

Proof. Let SGt, t G [to,t cur ], be the graph snapshot we 
maintain. For a time point tk G [to,t cur ] such that tk > t, 
we reconstruct SGt k with forward reconstruction by apply- 
ing A[ tjtfc ] C A [t0i t cur ] on SG t . For a time point U G [t ,t cur ] 
such that ij < t, we reconstruct SG tl with backward recon- 
struction by applying A [t(jt] C A[ t0itcnr j on SG t . 

Thus, based on Theorem 1, to capture the evolution of G 
and support historical queries, it suffices to maintain either 
the original graph snapshot SGt , since with forward recon- 
struction, we can derive any snapshot SGt, or the current 
graph snapshot SGt cur , since with backward reconstruction, 
we can derive again any SGt- The only difference between 
these two approaches is the cost required for reconstructing 
SG t . 

If we assume that the cost of applying either the delta 
or the inverted delta on a snapshot is the same, the main 
factor that influences the reconstruction cost is the amount 
of operations (and their type) that we need to apply on the 
given snapshot. Therefore, it is easy to see that maintaining 
the original snapshot SGt is more appropriate when we ex- 
pect more queries about the past, while maintaining SGt aur 
is more appropriate when we expect queries about the more 
recent past to be more popular. 

In our work, we follow the second approach, since this 
approach supports queries on the current graph snapshot 
more efficiently. Thus, we maintain the current snapshot 
SGt cur and Ar t0i t our .]. As updates occur in G, we need to 
update both the current snapshot and the delta. Algorithm 
|3]describes the update procedure. It uses an additional tem- 
porary delta that records the updates on G until the next 
time unit and then applies this delta on the current snap- 
shot to derive the next current snapshot. The algorithm is 
applied anew for the next time unit, and so on. 
Materializing Snapshots. While maintaining a single 
snapshot and the delta suffices for reconstructing any graph 
snapshot in the time interval covered by the delta, such re- 
construction may not be efficient. As time progresses and 
more update operations occur, deltas grow in size and ap- 
plying large parts of them to reconstruct past snapshots may 
become very costly. For instance, if we want to reconstruct 
the original graph snapshot, we have to apply the entire 
delta file that may include update operations that have oc- 
curred in a social network over months or even years. 

To improve efficiency, we propose materializing and main- 
taining intermediate graph snapshots in addition to the cur- 
rent snapshot SG tcur . Let S be the sequence, SG til , ■ ■ ■ 



Algorithm 3 Update(G, SG tcur , A [t0itcu ,.]) 
Input: G,SGt aur , A [to , w] 
Output: SGt c „ r +i>A[ to ,t our+1 ] 

1: Initialize A',. , , 

2: while Time point t £ [t cur , t C ur+i] do 

3: for all Update operations op on G in t 6 [t CU r, t CU r+i] do 
4: Record (op, t) in Aj t ( j 

5: end for 
6: end while 

7: SGt cur+1 = A{ WiW+l] o SG W 

8: Append Aj t ( ^ at the end of A[ t(utaur ^ to get 

A [«0,tc ur+ l] 

9: return SG w+1 ,A [toit j; 



SGtj , SGt cur , Tn > 1, of the available materialized snap- 
shots. To reconstruct a snapshot SGt k , we would like to 
start our reconstruction from the snapshot in this sequence 
that would result in the most efficient reconstruction. We 
consider different approaches on how to select the most ap- 
propriate snapshot, SGt l G S, for reconstructing a snapshot 
SG tk . 

Time-based selection. Given the sequence S of materialized 
snapshots, the snapshot SGt t is defined as the one closest in 
time to tk, i.e., the one with the smallest \tk — ti\ value over 
all ti G [to,t cur ] for which we have materialized snapshots 
available. 

Operation-based selection. Given the sequence S of mate- 
rialized snapshots, the snapshot SGt t is defined as the one 
for which the operations in the delta (A[t,,t k ], if ti < tk, or 
Arj. it; ] if U > tk) that need to be applied on SG t , to derive 
SGt k are the minimum over all the other deltas correspond- 
ing to the other snapshots in S. 

Regardless of the selection method chosen, depending on 
whether ti < tk or ti > tk, we need to apply respectively 
forward or backward reconstruction using the corresponding 
part of the delta. 

Time-based selection can be applied more efficiently, as 
we only need to determine the snapshot closest in time to 
SGt k ■ However, if the update operations are not uniformly 
distributed through time, as is usually the case in social 
networks where churns of activity occur often, the selected 
snapshot SGt, is not the most appropriate one. 

On the other hand, operation-based selection requires that 
we measure the number of operations on the corresponding 
A[t,,tj.] if h < tk or A [tfcit; j if ti > t k which induces an 
additional cost. However, this selection guarantees that the 
selected snapshot yields the best cost for the reconstruction 
process as it requires the minimum number of operations to 
be applied. 

To facilitate this process, we may also assume that deltas 
are split into disjoint intervals. In particular, along with 
each snapshot SGi j in the sequence of materialized snap- 
shots, we may maintain a delta Ar t , t . i reporting the 

update operations from the snapshot preceding it in the se- 
quence. 

Discussion. An important issue that arises is when do 
we materialize a graph snapshot, i.e., how do we select the 
time points at which we materialize the next snapshot in the 
sequence. 

A straightforward approach is to materialize snapshots 
periodically, e.g., take one snapshot per hour, day or month. 
However, this solution has the same problem with the time- 
based selection of snapshots, i.e., it assumes that changes in 
a social graph occur uniformly through time. 

Similarly to operation-based selection, an alternative ap- 



proach is to determine whether to materialize the next snap- 
shot or not based on the amount of update operations that 
have occurred. Thus, time periods with many changes would 
be represented with more snapshots than time periods with 
fewer changes. 

Finally, snapshot materialization can be based on the sim- 
ilarity between snapshots. If two snapshots in successive 
time periods are similar, then we do not need to material- 
ize both, whereas, if they differ significantly, then we could 
materialize both. While at first, this approach seems simi- 
lar to determining the next materialized snapshot based on 
the number of update operations, they are not the same. 
A snapshot may not be very different from a previous one, 
even if many operations have occurred, if such operations 
reverse themselves, e.g. the same nodes join and leave the 
graph repeatedly. 

3. EVALUATING HISTORICAL QUERIES 

So far, we have discussed the problem of reconstructing 
snapshots, given one or more materialized graph snapshots 
and deltas. In this section, we address the problem of eval- 
uating historical queries. 

3.1 Query Types 

Historical graph queries can be categorized along two di- 
mension: time and the part of the graph they involve. With 
regards to the time dimension, queries can be further distin- 
guished into point queries and range queries. Point queries 
refer to a single point in time, for example, what is the de- 
gree of node Vi at tk, that may correspond to asking for the 
number of friends that user m had at this specific time in 
the past. Range queries refer to a time interval or a set of 
time intervals and can be further classified as differential or 
aggregate. Differential range queries evaluate how much a 
measure changes during a time interval. An example such 
query is asking how much the degree of node Vi changed in 
[tk,ti], that may correspond to asking about the change of 
popularity of user Ui in this time interval. Finally, aggregate 
range queries evaluate an aggregate function over a time in- 
terval. An example such query is looking for the average 
degree of node Vi in that may correspond to asking 

for the average number of friends that user m had in this 
interval. 

With regards to the part of the graph, we distinguished 
queries as either node-centric or global queries [4]. Node- 
centric queries are queries that involve one or a few nodes 
of the graph. The degree query that we have used as an ex- 
ample for the time dimension is a node-centric query. The 
main characteristic of such queries is that their evaluation 
does not require traversing the entire graph but only access- 
ing a subgraph targeted by the query. Other node-centric 
queries include neighborhoods, induced subgraphs, and K- 
core queries. Global queries are queries that refer to prop- 
erties of the entire graph. Example global queries include 
PageRank-based queries, the discovery of connected compo- 
nents and estimating the diameter and the degree distribu- 
tion. Table [T] summarizes our query classification. 

3.2 Query Processing 

Next, we present different plans for evaluating historical 
queries in [to, tour] on a graph G given the current graph 
snapshot SG CU r and its delta, Ar t0itc i. 

3.2.1 Two-Phase Query Plan 

A general strategy for evaluating any historical query q 
for any time point or range in [to,t cur ] is to reconstruct 
the required graph snapshots that are determined by the 
query and then evaluate q on them. Thus, the query pro- 
cessing plan is a two-phase plan that involves (I) a snap- 



shot reconstruction phase and (2) a query processing phase. 
During snapshot reconstruction, backwards reconstruction 
is applied on SGt cur to acquire the snapshots required for 
evaluating q. The query processing phase takes as input the 
graph snapshots generated by the first phase, evaluates q on 
them and combines the results if needed so as to derive the 
final query result. This is the most general plan and can be 
used to evaluate all types of queries as indicated in Table [2] 
For example for point, node-centric queries that ask for 
evaluating a measure m for a node Vi (e.g., m may be be 
the degree of Vi) at time point tk, the two-phase query plan 
is defined as follows. 



Input: SG tcur , A [t0jtcur ], t k e [t ,t 

curl, Ui 

Output: m(vi) 

1: SG tfc =BackRec(SG w , A [t0iW] , t k ) 
2: evaluate m(vi) on SGt k 
3: return m(t>;); 



Now, consider a point range query with range [t k ,ti] C 
[to,t cur ]. For a point differential query (e.g., how much the 
degree of Vi has changed in [tk,ti]), the query plan requires 
the construction of two snapshots. Note that the second 
snapshot, SGt,, is reconstructed based on the first recon- 
structed snapshot SGt, to avoid applying the same part of 
A[ t0itcur ] twice on the current graph. 



Input: SG tcur , &[t ,t cur ]> [*fe>*i] C [t ,t cur ], Vi 
Output: d 

1: SG ti =BackRec(S'Gi cur , A [t0jW] , t k ) 
2: 5G tfc =BackRec(SG t( , A [t0jW] , ti) 
3: evaluate m k (vi) on SGt k 
4: evaluate rra;(t>i) on SGt, 
5: d = \m k (vi) - mi{vi)\ 
6: return d; 



Let us now consider an aggregate range node-centric query 
(e.g., the average degree of Vi in [tk, ti]) denoted by F(m(vi)). 
This query requires the construction of a snapshot for each 
time unit in the time interval so as to compute the average 
between all values of m(vi) in this time range. 



Input: SG tcur , &[t ,t cur ], Q [*o,*cw], v t 

Output: F(m(vi)) 

1: for all t e [t k ,ti] do 

2: S , G t =BackRec(S , G tcur , A [to>w]) t) 

3: evaluate mijiij) on SGt 

4: end for 

5: apply aggregation function F on all mt(ut) 
6: return F(m(vi))\ 



Similar algorithms can be used for global queries. 

For simplicity, we have assumed that reconstruction uses 
only the current graph snapshot. If materialized snapshots 
are maintained, the only difference is that a selection phase 
is applied before the reconstruction phase. During the se- 
lection phase, we determine the most appropriate snapshot 
to be used for reconstruction and based on this selection 
whether to use forward or backward reconstruction. If more 
than one snapshot need to be reconstructed for query pro- 
cessing, then a different selection and reconstruction pro- 
cedure may be used for each one of them. An interesting 
problem in this case is re-using reconstructed snapshots. A 
simple example was shown for point range queries. 

The two-phase query plan can be used to evaluate all types 
of historical queries. However, the reconstruction of snap- 
shots can be costly. Next, we consider alternative plans for 
specific query types that avoid this phase. 



Table 1: Examples of query types 



. Graph 


Node-centric 


Global 


Point 


the degree of Vi at tk 


the diameter of G at tk 


Range 


Differential 


how much the degree of w, changed in [tfc,ti] 


how much the diameter of G changed in [tk, tjj 


Aggregate 


average degree of Vi in [tk,ti\ 


average diameter of G in [tfc,t;J 



Table 2: Query processing 



Query Types 


Query Plans 






Two 
Phase 


Delta 
only 


Hybrid 


Point 


Node-centric 


/ 




✓ 


Global 


/ 






Range 
differential 


Node-centric 


/ 


/ 




Global 


✓ 






Range 
aggregate 


Node-centric 






✓ 


Global 









3.2.2 Delta-Only Query Plan 

With delta- only query plans, a query is evaluated directly 
on the deltas. No snapshot reconstruction is required. Fur- 
thermore, there is no need to access any of the snapshots. 
Such plans are applicable to differential range node-centric 
queries (Table [2]). For such queries one can compute how 
much a measure has changed by accessing the correspond- 
ing update operations in the delta file for the given time 
interval. 

For instance, consider a range differential node-centric 
query asking for the difference in the degree of node Vi in 
[£fc,i;]. This query can be evaluated with a delta-only plan, 
if one just counts the add and remove edge operations that 
involve v t in A[ tjtjtl] . 

3.2.3 Hybrid Query Plan 

Finally, we consider hybrid plans. Such plans access both 
the current snapshot and the delta, but do not require the 
reconstruction of any graph snapshots. 

These plans are applicable to point and aggregate range 
node-centric queries (Table [5}. For instance, consider an ag- 
gregate range node-centric query, e.g., asking for the average 
degree of Vi in [tk,ti]. The hybrid plan evaluates the degree 
of Vi on SGcur and then traverses Art. t ,i to compute the 
degree at each time unit in the requested range. Then, its 
average is computed. 

Note, that there are cases in which more than one pass 
of the A[ tfc ti j may be required to evaluate a node-centric 
measure. For instance, consider a query for the average 
degree of the induced subgraph of Vi in [tk,t{\, where the 
induced subgraph of Vi is the subgraph formed by Vi and its 
neighbors. By traversing the delta, we may add new nodes 
in the subgraph and therefore need to go back and include 
edges of these specific new nodes that have not been included 
initially, as they were not part of the original subgraph. 

3.3 Delta Indexing and Optimizations 

As pointed out, snapshot reconstruction is the most costly 
phase in our query plans and we would rather avoid it when- 
ever possible. For queries for which reconstruction cannot 
be avoided, we present techniques that improve its efficiency. 

3.3.1 Partial Reconstruction 

The difference between node-centric and global queries 
is that node-centric queries do not require traversing the 
entire graph but are targeted to one or a few nodes. Let 
G' = (V',E r ) be the sub-graph of G that a node-centric 



query q needs to access. Then, instead of reconstructing 
the snapshots SGt that q requires of the entire graph G, 
it suffices to reconstruct the corresponding snapshots of the 
subgraph G' . During snapshot reconstruction, all add and 
remove operations involving nodes and edges such that Vi £ 
V' and (vi,Vj) £ E' are ignored. Multiple passes of the delta 
may be required to determine all elements to be included in 
the snapshots. 

3.3.2 Indexing 

To further improve efficiency during reconstruction, we 
propose building indexes on the delta. Indexing also im- 
proves delta-based query plans by enabling faster access to 
specific parts of the delta. Indexing may improve perfor- 
mance significantly, especially considering that the size of a 
delta grows constantly though time. 

Temporal Index. Snapshot reconstruction and query pro- 
cessing usually require applying or accessing a part Ar t 1 1 C 
A[t ,t CUT .] °f the delta file to reconstruct graph snapshots or 
evaluate a query. Therefore, using a temporal index im- 
proves the efficiency of these procedures by enabling faster 
access to the desired parts of the delta. 
Node-centric Index. Besides temporal indexing, another 
option is to apply node-centric indexing to enable the ef- 
ficient location of all operations associated with a specific 
node. A node-centric index improves the evaluation of node- 
centric queries for all three query plans that we have dis- 
cussed. It also facilitates partial reconstruction. 

4. PRELIMINARY EVALUATION 

The efficiency of processing historical queries depends on 
the underlying storage model. In our initial implementa- 
tion, deltas are stored in append-only files. Any materialized 
graph snapshots are stored in a native graph database. We 
also maintain an in-memory node-centric index on the delta 
file. The use of a native graph database results in faster ex- 
ecution of graph queries, but it does not support any form 
of locality. 

The goal of this preliminary evaluation is to present some 
initial quantitative results regarding the efficiency of the pro- 
posed query plans. To this end, we run a node-centric query 
that asks for the degree of a random node v at time point 
t. We evaluate the following four different plans: (a) a two- 
phase query plan without indexing (two-phase), (b) a hybrid 
query plan without indexing (hybrid), (c) a two-phase ap- 
proach with indexing (two-phase-index) and (d) a hybrid 
approach with indexing (hybrid- index). For the two-phase 
approach, we used partial reconstruction. We also used only 
the current snapshot and backward reconstruction (no ad- 
ditional materialized snapshots were used). 

We generate graphs that are scale- free, in an effort to 
mimic the form of online social network graphs. To generate 
scale-free graph snapshots, we use the method in [TT] that 
extends the Barabasi algorithm [T] for generating successive 
scale-free graphs. Table [3] summarizes the characteristics of 
the synthetic dataset. All algorithms are implemented in 
Java. The experiments were run on a Linux Machine with 
2.8GHz Dual Core Intel and 4GB of memory. As our native 
graph database, we used Neo4j Q. 

http://neo4j.org 



Table 3: Synthetic dataset 
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Figure 1: Run time in ms for executing a degree 
query at different time points (time measured in op- 
erations). 

Figure Q] reports the run time in milliseconds for exe- 
cuting the query at different time points. The time in the 
x-axis (measured in number of operations) proceeds back- 
wards (i.e., point corresponds to the current snapshot). 
The more time passes from the current snapshot, the more 
expensive is to evaluate the query, since reconstructing the 
past snapshot requires the application of more operations. 
The two-phase algorithm takes the most amount of time, 
due to the cost of the reconstruction phase. This phase 
is especially expensive, since in Neo4j, any modification to 
stored data is associated with a transaction and is flashed 
directly to disk. The usage of a node-centric index on the 
delta file leads to significant gains for both the two-phase 
and the hybrid approach. 

5. RELATED WORK 

There is a large body of work on temporal data manage- 
ment including relational databases (see, for example [LQ] 
and P2] for excellent surveys on the topic), RDF (e.g., [5]) 
and XML documents (e.g., [8], [2]). Although maintaining 
deltas has also been used in such cases, the large scale and 
the logical model, being in our case a graph, introduces new 
problems. 

Collecting a sequence of versions of XML documents from 
the web is considered in [8]. The difference between two 
consecutive versions is computed and represented by com- 
plete deltas based on persistent identifiers assigned to each 
XML node, while only the current version of the document 
is maintained. To avoid the overhead of applying deltas to 
retrieve previous versions, in 2\, they merge all versions of 
XML data into one hierarchy where an element appearing 
in multiple versions is stored only once along with a times- 
tamp. To handle temporal RDF data, temporal reasoning is 
incorporated into RDF in [3], thus yielding temporal RDF 
graphs. Semantics were also defined for these graphs which 
include the notion of temporal entailment as well as a syntax 
to incorporate this framework into standard RDF graphs by 
adding temporal labels. Clearly, our approach is different in 
that it considers time with respect to graph evolution. 

Numerous algorithms and data structures have been pro- 
posed for processing graph queries on large graphs. GBASE 
[4] and Pregel [7] are two general graph management sys- 
tems that work in parallel and distributed settings and sup- 
port large-scale graphs for various applications. GBASE is 
based on a common underlying primitive of several graph 
mining operations, which is shown to be a generalized form 
of matrix- vector multiplication [5], while Pregel is based 
on a sequence of supersteps that are applied in parallel by 



each node executing the same user-defined function that ex- 
presses the logic of a given algorithm and are separated by 
global synchronization points. In future work, we plan to 
explore such techniques for reconstructing snapshots in par- 
allel. 

The most relevant to our work is per haps the historical 
graph structure recently proposed in [TT]. The authors con- 
sider a sequence of graphs produced as the graph evolves 
over time. Since the graphs in the sequence are very similar 
to each other, they propose computing graph representatives 
by clustering similar graphs and then storing appropriate 
differences from these representatives, instead of storing all 
graphs. Our approach is different in that we want to sup- 
port a broad range of historical queries, not just queries that 
involve a single snapshot graph. 

There is also a large body of work that studies the evolu- 
tion of real- world networks over time. In [Bj, it was shown 
that for a variety of real-world networks, graph density in- 
creases following a power-low and the graph diameter shrinks. 
Wo rks on specific networks such as Flickr [9] and Facebook 
[13| also study network growth and their results can be ex- 
ploited to enrich our model. 

6. CONCLUSIONS 

In this paper, we presented a model for capturing graph 
evolution through time based on the use of graph snapshots 
and deltas. We showed how by maintaining only the current 
graph snapshot and a delta, we can reconstruct any past 
graph snapshot. Then, we introduced a general two- phase 
query plan based on snapshot reconstruction to evaluate any 
historical query as well as a couple of more efficient plans 
that avoid reconstruction for specific queries. Finally, we 
presented preliminary experimental results. 
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